Goto

Collaborating Authors

 permissive license


Long Code Arena: a Set of Benchmarks for Long-Context Code Models

Bogomolov, Egor, Eliseeva, Aleksandra, Galimzyanov, Timur, Glukhov, Evgeniy, Shapkin, Anton, Tigina, Maria, Golubev, Yaroslav, Kovrigin, Alexander, van Deursen, Arie, Izadi, Maliheh, Bryksin, Timofey

arXiv.org Artificial Intelligence

Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows -- supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers.


GECKO: Generative Language Model for English, Code and Korean

Oh, Sungwoo, Kim, Donggyu

arXiv.org Artificial Intelligence

We introduce GECKO, a bilingual large language model (LLM) optimized for Korean and English, along with programming languages. GECKO is pretrained on the balanced, high-quality corpus of Korean and English employing LLaMA architecture. In this report, we share the experiences of several efforts to build a better data pipeline for the corpus and to train our model. GECKO shows great efficiency in token generations for both Korean and English, despite its small size of vocabulary. We measure the performance on the representative benchmarks in terms of Korean, English and Code, and it exhibits great performance on KMMLU (Korean MMLU) and modest performance in English and Code, even with its smaller number of trained tokens compared to English-focused LLMs. GECKO is available to the open-source community under a permissive license. We hope our work offers a research baseline and practical insights for Korean LLM research. The model can be found at: https://huggingface.co/kifai/GECKO-7B


The Stack: 3 TB of permissively licensed source code

Kocetkov, Denis, Li, Raymond, Allal, Loubna Ben, Li, Jia, Mou, Chenghao, Ferrandis, Carlos Muñoz, Jernite, Yacine, Mitchell, Margaret, Hughes, Sean, Wolf, Thomas, Bahdanau, Dzmitry, von Werra, Leandro, de Vries, Harm

arXiv.org Artificial Intelligence

Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data. We make the dataset available at https://hf.co/BigCode, provide a tool called "Am I in The Stack" (https://hf.co/spaces/bigcode/in-the-stack) for developers to search The Stack for copies of their code, and provide a process for code to be removed from the dataset by following the instructions at https://www.bigcode-project.org/docs/about/the-stack/.


What is transfer learning and why is it needed?

#artificialintelligence

In this article, we will be discussing a state-of-the-art technique for building complex deep learning models using pre-trained models. We humans have an inherent ability to transfer our knowledge across different tasks. We utilize the knowledge that we acquire from one task to solve other similar tasks. The more related the task, the easier it is for us to transfer or cross-utilize our knowledge. Let's understand this with some examples: Conventional machine learning and deep learning algorithms are designed to work in isolation.


GitHub's Commercial AI Tool Was Built From Open Source Code

#artificialintelligence

Earlier this month, Armin Ronacher, a prominent open-source developer, was experimenting with a new code-generating tool from GitHub called Copilot when it began to produce a curiously familiar stretch of code. The lines, drawn from the source code of the 1999 video game Quake III, are infamous among programmers--a combo of little tricks that add up to some pretty basic math, imprecisely. The original Quake coders knew they were hacking. "What the fuck," one commented in the code beside an especially egregious shortcut. So it was strange for Ronacher to see such code generated by Copilot, an artificial intelligence tool that is marketed to generate code that is both novel and efficient.


Linux Foundation unveils new permissive license for open data collaboration - JackOfAllTechs.com

#artificialintelligence

The Linux Foundation has announced a new permissive license designed to help foster collaboration around open data for artificial intelligence (AI) and machine learning (ML) projects. It has often been said that data is the new oil, but for AI and ML projects in particular, having access to expansive and diverse data sets is key to reducing bias and building powerful models capable of all manner of intelligent tasks. To machines, data is a little like "experience" is to humans -- the more of it you have, the better decisions you are likely to make. With CDLA-Permissive-2.0, the Linux Foundation is building on its previous efforts to encourage data-sharing efforts through licensing arrangements that clearly define how the data -- and any derivative data sets -- can and can't be used. The Linux Foundation first introduced the Community Data License Agreement (CDLA) back in 2017 to entice organizations to open up their vast pools of (underused) data to third-parties.


Transfer Learning for Deep Learning: Pre-trained models to save training time and cost

#artificialintelligence

Training a neural network has been posing problems for researchers and developers for a long time. There are basically two major problems that arise during the development of DL based solution which are the astronomical costs of training, and the time required to train the network. Since training a neural network includes numerous matrix operations and demands a high computational capability, the cost of operation will escalate if one needs to perform a similar process again for another model. Also, the time to train them escalates at an exponential rate as the networks get deeper and complicated. Using GPUs is one effective way to speed up the process.


Inside the 1TB ImageNet data set used to train the world's AI: Nude kids, drunken frat parties, porno stars, and more

#artificialintelligence

Special report ImageNet – a data set used to train AI systems around the world – contains photos of naked children, families on the beach, college parties, porn actresses, and more, scraped from the web to train computers without those individuals' explicit consent. The library consists of 14 million images, each placed into categories that describe what's pictured in each scene. This pairing of information – images and labels – is used to teach artificially intelligent applications to recognize things and people caught on camera. The database has been downloaded by boffins, engineers, and academics to train hundreds if not thousands of neural networks to identify stuff in photos – from assault rifles and aprons to magpies and minibuses to zebras and zucchinis, and everything in between. In 2012, the data set was used to build AlexNet, heralded as a breakthrough development in deep learning since it marked the first time a neural network outperformed traditional computational methods at object recognition in terms of accuracy.


A Gentle Introduction to Transfer Learning for Deep Learning - Machine Learning Mastery

#artificialintelligence

Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task. It is a popular approach in deep learning where pre-trained models are used as the starting point on computer vision and natural language processing tasks given the vast compute and time resources required to develop neural network models on these problems and from the huge jumps in skill that they provide on related problems. In this post, you will discover how you can use transfer learning to speed up training and improve the performance of your deep learning model. A Gentle Introduction to Transfer Learning with Deep Learning Photo by Mike's Birds, some rights reserved. Transfer learning is a machine learning technique where a model trained on one task is re-purposed on a second related task.